Data Import & Analysis
In this area, we will load and analyze the data for the work in order to make decisions on how to proceed.
Data Import
We Import the dataset SP500_data.csv and make a copy to work
with it and named it data. We copy it so we can be secure that
i do not make any changes in the original dataset.
We use several
libraries to process the tasks and get the output that is asked.
Data Exploration
This section gives a concise view of the Tweets on the Swiss
Univerity Social Media accounts data.
The dataset consists 19’575
observations and 14 variables:
Time Range and Tweet Frequency:
- Tweets are from September 29, 2009, to January 26, 2023 and this indicates a long-term use of Twitter
- The median tweet date is April 13, 2018, suggesting that half of the
tweets were posted after this date and the data is skewed
Retweet and Favorite Counts:
- The data shows a minimum of 0 and a maximum of 267 retweets and 188 likes per tweet
- the median and first quartile for retweets and likes are 0, indicating that many tweets receive little to no engagement
- The
in_reply_to_screen_namefield suggests that some tweets are responses to other users, which might indicate engagement or conversation strategies used by the university
ID and String Variables:
- The id
and id_str fields are technical identifiers for tweets,
indicating that tweets have been collected over a wide range of tweet
Language and University Fields:
- The
langshows the common language used at the university universityshows the abbreviation of the university
Temporal Patterns:
created_at,tweet_date,tweet_hour, andtweet_monthprovide detailed temporal data- can be analyzed to understand peak times of activity and seasonal or
monthly trends in tweeting behavior.
Content Analysis
The word cloud represents the most frequently used words in the filtered tweets with high engagement (likes or retweets). Key observations include:
Frequent Terms: Larger words such as “bachelor,” “design,” “die,” “das,” “der,” and “amp” indicate their higher occurrence. Key Topics: “bachelor” for Bachelor’s programs or graduates. “design” related to design courses or projects. “HSLU” (Hochschule Luzern). General terms: “schweiz,” “zeigen,” “nicht.” Note: The term “amp” appears due to HTML encoding and is not meaningful.
## # A tibble: 6 × 14
## created_at id id_str full_text in_reply_to_screen_n…¹
## <dttm> <dbl> <chr> <chr> <chr>
## 1 2023-01-20 17:17:32 1.62e18 1616469988369469… "Im MSc … <NA>
## 2 2023-01-13 07:52:01 1.61e18 1613790954737074… "Was bew… <NA>
## 3 2023-01-12 19:30:01 1.61e18 1613604227141537… "Was uns… <NA>
## 4 2023-01-12 08:23:00 1.61e18 1613436367169634… "Eine di… <NA>
## 5 2023-01-11 14:00:05 1.61e18 1613158809081450… "Wir gra… <NA>
## 6 2023-01-10 17:06:11 1.61e18 1612843252083834… "Unsere … <NA>
## # ℹ abbreviated name: ¹in_reply_to_screen_name
## # ℹ 9 more variables: retweet_count <int>, favorite_count <int>, lang <chr>,
## # university <chr>, tweet_date <dttm>, tweet_minute <dttm>,
## # tweet_hour <dttm>, tweet_month <date>, timeofday_hour <chr>
## created_at id
## Min. :2009-09-29 14:29:47.0 Min. : 4468752018
## 1st Qu.:2015-01-28 15:07:41.5 1st Qu.: 560439073866000000
## Median :2018-04-13 13:26:56.0 Median : 984754806702000000
## Mean :2017-12-09 15:26:50.7 Mean : 939953703992000000
## 3rd Qu.:2020-10-20 10:34:50.0 3rd Qu.:1318470720360000000
## Max. :2023-01-26 14:49:31.0 Max. :1618607065240000000
## id_str full_text in_reply_to_screen_name
## Length:19575 Length:19575 Length:19575
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## retweet_count favorite_count lang university
## Min. : 0.000 Min. : 0.00 Length:19575 Length:19575
## 1st Qu.: 0.000 1st Qu.: 0.00 Class :character Class :character
## Median : 1.000 Median : 0.00 Mode :character Mode :character
## Mean : 1.289 Mean : 1.37
## 3rd Qu.: 2.000 3rd Qu.: 2.00
## Max. :267.000 Max. :188.00
## tweet_date tweet_minute
## Min. :2009-09-29 00:00:00.00 Min. :2009-09-29 14:29:00.00
## 1st Qu.:2015-01-28 00:00:00.00 1st Qu.:2015-01-28 15:07:00.00
## Median :2018-04-13 00:00:00.00 Median :2018-04-13 13:26:00.00
## Mean :2017-12-09 02:25:45.00 Mean :2017-12-09 15:26:24.68
## 3rd Qu.:2020-10-20 00:00:00.00 3rd Qu.:2020-10-20 10:34:30.00
## Max. :2023-01-26 00:00:00.00 Max. :2023-01-26 14:49:00.00
## tweet_hour tweet_month timeofday_hour
## Min. :2009-09-29 14:00:00.00 Min. :2009-09-01 Length:19575
## 1st Qu.:2015-01-28 14:30:00.00 1st Qu.:2015-01-01 Class :character
## Median :2018-04-13 13:00:00.00 Median :2018-04-01 Mode :character
## Mean :2017-12-09 14:59:43.81 Mean :2017-11-24
## 3rd Qu.:2020-10-20 10:00:00.00 3rd Qu.:2020-10-01
## Max. :2023-01-26 14:00:00.00 Max. :2023-01-01
Data Manipulation
In this area we will prepare the data for analysis.
Languages
Here we calculate the frequency of each language present in the
tweets dataset and sorts these frequencies in descending order.
The
output indicates that German (de) is the most common language with
14,474 occurrences, followed by Italian (it) with 1,865 and French (fr)
with 1,792. English (en) comes next with 1,280 tweets. The frequencies
of other languages, including rare and less commonly used ones, are also
listed, showcasing the linguistic diversity in the dataset.
# Count the frequency of each language
lang_counts <- table(tweets$lang)
# Sort the language frequencies in descending order
sort(lang_counts, decreasing = TRUE)##
## de it fr en qam qme es ca da ro nl in et
## 14474 1865 1792 1280 35 21 19 10 10 10 9 7 6
## und pt zxx art lv cy fi lt no qht cs eu ht
## 6 4 4 3 3 2 2 2 2 2 1 1 1
## ja sv tl tr
## 1 1 1 1
Due to the fact that German, Italian, French and English are the
most frequently listed languages and other languages are not used in
large numbers and are not among the most spoken languages in
Switzerland, we limit the data set to the 4 most important ones.
# Filter the DataFrame to keep only tweets in German, Italian, French and English
filtered_tweets <- tweets[tweets$lang %in% c("de", "it", "fr", "en"), ]
# Check the resulting language distribution
table(filtered_tweets$lang)##
## de en fr it
## 14474 1280 1792 1865
This gives us the new Summeray of the data set:
- Number of Records: The total count of tweets has decreased from 19,575 to 19,411, indicating some tweets have been removed or filtered out.
- Date and Time: Minimal changes are reflected across the median and mean values.
- Other Attributes: No significant changes are observed in the ranges.
## created_at id
## Min. :2009-09-29 14:29:47.00 Min. : 4468752018
## 1st Qu.:2015-02-04 11:39:32.00 1st Qu.: 562923403041000000
## Median :2018-04-17 13:53:07.00 Median : 986210946744999936
## Mean :2017-12-11 15:27:49.55 Mean : 940675313339000064
## 3rd Qu.:2020-10-20 11:09:15.50 3rd Qu.:1318479385120000000
## Max. :2023-01-26 14:49:31.00 Max. :1618607065240000000
## id_str full_text in_reply_to_screen_name
## Length:19411 Length:19411 Length:19411
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## retweet_count favorite_count lang university
## Min. : 0.000 Min. : 0.000 Length:19411 Length:19411
## 1st Qu.: 0.000 1st Qu.: 0.000 Class :character Class :character
## Median : 1.000 Median : 0.000 Mode :character Mode :character
## Mean : 1.293 Mean : 1.376
## 3rd Qu.: 2.000 3rd Qu.: 2.000
## Max. :267.000 Max. :188.000
## tweet_date tweet_minute
## Min. :2009-09-29 00:00:00.0 Min. :2009-09-29 14:29:00.00
## 1st Qu.:2015-02-04 00:00:00.0 1st Qu.:2015-02-04 11:39:00.00
## Median :2018-04-17 00:00:00.0 Median :2018-04-17 13:53:00.00
## Mean :2017-12-11 02:26:53.7 Mean :2017-12-11 15:27:23.56
## 3rd Qu.:2020-10-20 00:00:00.0 3rd Qu.:2020-10-20 11:09:00.00
## Max. :2023-01-26 00:00:00.0 Max. :2023-01-26 14:49:00.00
## tweet_hour tweet_month timeofday_hour
## Min. :2009-09-29 14:00:00.00 Min. :2009-09-01 Length:19411
## 1st Qu.:2015-02-04 11:30:00.00 1st Qu.:2015-02-01 Class :character
## Median :2018-04-17 13:00:00.00 Median :2018-04-01 Mode :character
## Mean :2017-12-11 15:00:42.28 Mean :2017-11-26
## 3rd Qu.:2020-10-20 10:30:00.00 3rd Qu.:2020-10-01
## Max. :2023-01-26 14:00:00.00 Max. :2023-01-01
Emojis
The package emo is used for emoji analysis in R, which
is essential for text data that includes emojis. This is useful for
cleaning data, extracting information, or preparing text for further
analysis.
Understanding the prevalence of emojis can help analyze
sentiment, user engagement, or cultural trends in social media data.
# Install the emo package from GitHub for Emoji analyzes
if (!require("emo")) {
remotes::install_github("hadley/emo")
}## Lade nötiges Paket: emo
Tweet Analysis
In this section we will use the prepared data to analyze the tweets for frequency, interactions and universities.
Tweet Frequency Analysis
In this section we will analyze the tweets tweets for frequency of Swiss universities.
Tweet Frequency over Time
Each histogram shows fluctuations in tweet volumes over the years
- Universities like HSLU and ZHAW: Display
prominent peaks at certain intervals, possibly indicating targeted
social media campaigns or significant events that engaged the university
community. - Other Universities (e.g., BFH, FHNW): Some
show a steady level of activity with occasional spikes, while others
might exhibit a decline or increase in activity, suggesting changes in
social media strategy or external factors impacting engagement.
# Code to analyze tweet frequencies by time and university
p1<- filtered_tweets %>%
mutate(tweet_month = floor_date(created_at, "month")) %>%
group_by(university, tweet_month) %>%
summarize(count = n(), .groups = 'drop') %>%
ggplot(aes(x = tweet_month, y = count, fill = university)) +
geom_col(position = "dodge") +
theme_minimal() +
labs(title = "Monthly Tweet Frequency by University", x = "Year", y = "Number of Tweets")
# Convert to interactive plotly object
interactive_plot <- ggplotly(p1, tooltip = "text")
# Optionally, add configurations to enhance interaction
interactive_plot <- interactive_plot %>% layout(
hovermode = 'closest',
title = "Click on a University to see its Tweet Trends",
showlegend = TRUE
)
interactive_plotTweet Frequency - Terms
Here we returns terms that meet the high frequency threshold.
Text Preprocessing
We create a text corpus from filtered_tweets$clean_text,
where each tweet is treated as a separate document.
The corpus
serves as the foundational structure for text analysis, allowing for
uniform processing and manipulation of the text data.
# Corpus: Collection of text documents that generally serves as a basis for analysis in text processing and text mining.
# VectorSource(tweets): This vector is then used as the source for the corpus, whereby each entry in the vector becomes a separate document in the corpus.
# It is important that the text is extracted, as the corpus should only work with text data.
corpus <- Corpus(VectorSource(filtered_tweets$clean_text))
Here we clean the corpus by converting all text to lowercase,
removing punctuation, numbers, and stopwords from German, French,
Italian, and English, and finally stripping extra spaces.
Cleaning
the text is crucial for reducing noise and focusing analyses on
meaningful words only. This standardizes the text data, making
subsequent analyses like topic modeling or sentiment analysis more
effective and less prone to error due to textual inconsistencies.
# Clean text
corpus <- tm_map(corpus, content_transformer(tolower)) # Convert to lower case
corpus <- tm_map(corpus, removePunctuation) # Removing punctuation marks
corpus <- tm_map(corpus, removeNumbers) # Removing numbers
corpus <- tm_map(corpus, removeWords, stopwords("german")) # Removing stop words
corpus <- tm_map(corpus, removeWords, stopwords("french"))
corpus <- tm_map(corpus, removeWords, stopwords("italian"))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace) # Removal of additional spaces
corpus <- tm_map(corpus, stemDocument) #remove suffixes, etc.; only root form of the word
# Further clean the text by removing specific web/text symbols and terms
corpus <- tm_map(corpus, content_transformer(function(x) {
x <- gsub("–", "", x)
x <- gsub("…", "", x)
x <- gsub("«", "", x)
x <- gsub("»", "", x)
x <- gsub("\\b(rt|www|emojiemoji)\\b", "", x, ignore.case = TRUE) # Remove 'rt', 'www', and 'emojiemoji'
x <- gsub("amp", "", x, ignore.case = TRUE) # Remove 'amp' from HTML encoded '&'
x <- gsub("http[s]?://\\S+", "", x) # Remove URLs
return(x)
}))
Here we create a Document-Term Matrix (DTM) from the corpus,
applying additional filters like punctuation removal and stopping word
exclusion during the matrix formation. Then, it filters out terms that
appear in less than 1% of the documents to reduce sparsity.
Reducing sparsity helps focus on terms that have significant presence
across documents, enhancing the reliability and performance of
statistical models and algorithms applied later.
# Create DTM and remove sparse terms
dtm1 <- DocumentTermMatrix(corpus, control = list(removePunctuation = TRUE, stopwords = TRUE, wordLengths = c(1, Inf)))
dtm1 <- removeSparseTerms(dtm1, sparse = 0.99) # Adjust sparsity threshold as neededAnalyze Terms:
- Dominant Themes: Words like “schweizer” (Swiss), “unternehmen” (companies), “zukunft” (future), “innov” (innovation), and “digital” suggest that the text data heavily revolves around themes of Swiss companies, innovation, and digital advancements.
- Common Words: Frequent appearance of terms like “dank” (thanks), “neue” (new), “mehr” (more), and “info” indicate common communication patterns possibly related to news dissemination or updates about new developments and initiatives.
set.seed(123)
# Ensure word names are captured
word_freq1 <- sort(rowSums(as.matrix(dtm1)), decreasing = TRUE)
top_word_freq1 <- head(word_freq1, 80)
word_names1 <- colnames(dtm1)
# Generate word cloud using the correct word names
wordcloud(
words = word_names1,
freq = top_word_freq1,
max.words = 80,
scale = c(4, 0.5), # Control for size of the most and least frequent words
random.order = FALSE, # Higher frequency words appear first
rot.per = 0.25, # Allows some rotation for fitting
colors = brewer.pal(8, "Dark2") # Enhances visual appeal
)Tweets Frequency - Emojis
- Engagement Strategy: The frequent use of directional emojis like ➡️, ⤵️, and 👉 suggests that guiding readers to additional content or important links is a successful strategy.
- Content Themes: Emojis like 📖, 🔎, 💻, and 💡 highlight the focus on education, research, and technology.
- Celebratory Communication: Emojis such as 👏, 🎉, 🎓, and 🥳 signify celebration and achievement.
# Analyze the frequency of different emojis and select the top 50
emoji_freq2 <- table(unlist(filtered_tweets$emojis))
sort(emoji_freq2, decreasing = TRUE)[1:50]##
## ➡️ ⤵️ 👉 📖 🔎 🇨🇭 💡 💻 👏 🎉 📣 🚀 ✨ 🎬 💛 🔬
## 414 247 180 117 97 75 67 67 65 63 57 56 45 44 38 36
## 🤖 🆕 📅 🖤 🚨 🎓 🎙️ 🎄 😃 🏆 👇 📸 👍 💪 ⚡ 🌱
## 36 35 32 32 32 30 28 26 26 25 23 23 22 22 21 21
## 👩🎓 ▶️ 🌍 🏅 ☀️ 👨🎓 🙌 🌳 🥂 🥳 🍾 📝 📢 🔋 😉 🤝
## 21 20 20 20 19 19 19 18 18 18 17 17 17 17 17 17
## 📚 😎
## 16 16
High Engagement
In this section, we want to focus on tweets that have attracted more attention and interaction.
High Engagement - Terms
Text Preprocessing:
This section sets a variable engagement_threshold to 20,
which is used as the minimum number of likes or retweets a tweet must
have to be considered as having “high engagement”. This threshold helps
to focus on tweets that have garnered more attention and
interaction.
# Set a threshold for "high engagement" (e.g., tweets with at least 20 likes or retweets)
engagement_threshold <- 20
# Filter tweets based on this engagement threshold
high_engagement_tweets <- filtered_tweets %>%
filter(favorite_count >= engagement_threshold | retweet_count >= engagement_threshold)Also for the high_engagement_tweets we clean the corpus
by converting all text to lowercase, removing punctuation, numbers, and
stopwords from German, French, Italian, and English, and finally
stripping extra spaces and we create a Document-Term Matrix (DTM) from
this corpus.
# Rebuild the corpus with the sampled data
corpus2 <- Corpus(VectorSource(high_engagement_tweets$clean_text))
corpus2 <- tm_map(corpus2, content_transformer(tolower)) # Convert to lower case
corpus2 <- tm_map(corpus2, removePunctuation) # Removing punctuation marks
corpus2 <- tm_map(corpus2, removeNumbers) # Removing numbers
corpus2 <- tm_map(corpus2, removeWords, stopwords("german")) # Removing stop words
corpus2 <- tm_map(corpus2, removeWords, stopwords("french"))
corpus2 <- tm_map(corpus2, removeWords, stopwords("italian"))
corpus2 <- tm_map(corpus2, removeWords, stopwords("english"))
corpus2 <- tm_map(corpus2, stripWhitespace) # Removal of additional spaces
corpus2 <- tm_map(corpus2, stemDocument) #remove suffixes, etc.; only root form of the word
# Further clean the text by removing specific web/text symbols and terms
corpus2 <- tm_map(corpus2, content_transformer(function(x) {
x <- gsub("–", "", x)
x <- gsub("…", "", x)
x <- gsub("«", "", x)
x <- gsub("»", "", x)
x <- gsub("\\b(rt|www|emojiemoji)\\b", "", x, ignore.case = TRUE) # Remove 'rt', 'www', and 'emojiemoji'
x <- gsub("amp", "", x, ignore.case = TRUE) # Remove 'amp' from HTML encoded '&'
x <- gsub("http[s]?://\\S+", "", x) # Remove URLs
return(x)
}))
# Create DTM and remove sparse terms
dtm <- DocumentTermMatrix(corpus2, control = list(removePunctuation = TRUE, stopwords = TRUE, wordLengths = c(1, Inf)))
dtm <- removeSparseTerms(dtm, sparse = 0.99) # Adjust sparsity threshold as neededText Analyse:
The word cloud effectively illustrates which topics are most engaging
within the parameter for at least 20 likes or retweets. This
visualization can help in refining the communication and engagement
strategies by focusing on the topics that naturally engage your
audience.
- “forscherteam” (research team) and “entwickelt” (developed): suggest a strong emphasis on research and development topics.
- “lab”: indicates discussions possibly related to laboratory work or scientific studies.
- “data” and “digital”: reflect a focus on digital technologies and data science, crucial in contemporary research and education.
- “open”: could relate to open source, open access, or openness in research and education, pointing towards transparency and accessibility in academic resources.
- “nein” (no) and “wieso” (why): might indicate debates or discussions, possibly questioning certain methods or findings.
- “schweizer” (Swiss): identifies the national or cultural context, implying that the content is likely relevant to or originating from Swiss institutions or discussing Swiss innovations.
- “gespräch” (conversation): underscores the interactive or dialogical nature of the tweets, suggesting that engagement may be driven by conversational or discursive posts.
- Not well cleaned elements: The presence of strings like “http” might be artifacts from URLs or specific hashtags, which although not directly meaningful, indicate the inclusion of links or specific calls to action in the tweets.
set.seed(123)
# Ensure word names are captured
word_freq <- sort(rowSums(as.matrix(dtm)), decreasing = TRUE)
top_word_freq <- head(word_freq, 80)
word_names <- colnames(dtm)
# Generate word cloud using the correct word names
wordcloud(
words = word_names,
freq = top_word_freq,
max.words = 80,
scale = c(4, 0.5), # Control for size of the most and least frequent words
random.order = FALSE, # Higher frequency words appear first
rot.per = 0.25, # Allows some rotation for fitting
colors = brewer.pal(8, "Dark2") # Enhances visual appeal
)High Engagement - Emojis
Most Frequent Emojis:
- Utility and Guidance: Directional emojis like ➡️ and 👉 suggest that providing clear guidance or calls to action within tweets is effective in garnering engagement.
- Cultural and International Appeal: The presence of multiple national flags suggests that tweets connected to specific national contexts or international discussions.
- Emotional and Informative Content: Emojis like ✨ (sparkles) and 💛 (heart) are often used to add emotional depth or positivity to tweets. Similarly, 📅 (calendar) and 📢 (megaphone) likely denote event-related or important announcements that command attention.
# Analyze the frequency of different emojis
emoji_freq1 <- table(unlist(high_engagement_tweets$emojis))
sort(emoji_freq1, decreasing = TRUE)##
## ➡️ 🇨🇭 ⤵️ ✨ 🇨🇳 🇬🇧 🇳🇱 🇸🇪 🇸🇬 👉 💛 📅 📢 🗞️ 😀 😉 🚊 🚨
## 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Engagement Analysis by University
The bar chart visualizes the total likes accumulated by different
universities within the parameter for at least 20 likes or retweets,
highlighting variations in engagement across these institutions on
social media.
The visualization clearly shows which universities
are receiving the most engagement in terms of likes. HSLU (Lucerne
University of Applied Sciences and Arts) and ZHAW (Zurich University of
Applied Sciences) stands out with the highest engagement, significantly
more than other institutions. So institutions like HSLU and ZHAW,
offering a pathway for others to refine their social media tactics.
# Analysis of likes and retweets
high_engagement_tweets %>%
group_by(university) %>%
summarize(total_likes = sum(favorite_count), total_retweets = sum(retweet_count), .groups = 'drop') %>%
ggplot(aes(x = reorder(university, total_likes), y = total_likes)) +
geom_col() +
coord_flip() +
labs(title = "Engagement Analysis by University", x = "University", y = "Total Likes")HSLU & ZHAW Engagement Analysis
In this area, we will analyze the universities HSLU (Lucerne
University of Applied Sciences and Arts) and ZHAW (Zurich University of
Applied Sciences) to find out why they have significantly more
interactions compared to other universities.
Text Preprocessing:
For this we must again take text prepossessing measures, as in the previous analyses
#Filter Tweets for HSLU and ZHAW
hslu_zhaw_tweets <- filtered_tweets %>%
filter(university %in% c("hslu", "ZHAW"))
# Set a threshold for "high engagement" (e.g., tweets with at least 10 likes or retweets)
engagement_threshold1 <- 10
# Filter tweets based on this engagement threshold
hslu_zhaw_high_engagement_tweets <- hslu_zhaw_tweets %>%
filter(favorite_count >= engagement_threshold1 | retweet_count >= engagement_threshold1)
# Rebuild the corpus with the sampled data
corpus3 <- Corpus(VectorSource(hslu_zhaw_high_engagement_tweets$clean_text))
corpus3 <- tm_map(corpus3, content_transformer(tolower)) # Convert to lower case
corpus3 <- tm_map(corpus3, removePunctuation) # Removing punctuation marks
corpus3 <- tm_map(corpus3, removeNumbers) # Removing numbers
corpus3 <- tm_map(corpus3, removeWords, stopwords("german")) # Removing stop words
corpus3 <- tm_map(corpus3, removeWords, stopwords("french"))
corpus3 <- tm_map(corpus3, removeWords, stopwords("italian"))
corpus3 <- tm_map(corpus3, removeWords, stopwords("english"))
corpus3 <- tm_map(corpus3, stripWhitespace) # Removal of additional spaces
corpus3 <- tm_map(corpus3, stemDocument) #remove suffixes, etc.; only root form of the word
# Further clean the text by removing specific web/text symbols and terms
corpus3 <- tm_map(corpus3, content_transformer(function(x) {
x <- gsub("–", "", x)
x <- gsub("…", "", x)
x <- gsub("«", "", x)
x <- gsub("»", "", x)
x <- gsub("\\b(rt|www|emojiemoji)\\b", "", x, ignore.case = TRUE) # Remove 'rt', 'www', and 'emojiemoji'
x <- gsub("amp", "", x, ignore.case = TRUE) # Remove 'amp' from HTML encoded '&'
x <- gsub("http[s]?://\\S+", "", x) # Remove URLs
return(x)
}))
# Create DTM and remove sparse terms
dtm2 <- DocumentTermMatrix(corpus3, control = list(removePunctuation = TRUE, stopwords = TRUE, wordLengths = c(1, Inf)))
dtm2 <- removeSparseTerms(dtm2, sparse = 0.99) # Adjust sparsity threshold as neededText Analyse:
The word cloud illustrates which topics are most engaging within the
parameter for at least 10 likes or retweets.
- Topic Relevance: The engagement could be driven by the relevance of the topics discussed, such as climate goals and economic studies, which are significant areas of interest globally.
- “Content Quality:” The use of terms like “studie” and “beitrag” suggests high-quality content that offers value through educational insights or research findings, which is typically well-received in academic communities.
- Clear and Actionable Messages: The presence of terms related to achievements and actions indicates that clear, actionable content tends to perform well, as it likely inspires and motivates the audience.
set.seed(123)
# Ensure word names are captured
word_freq2 <- sort(rowSums(as.matrix(dtm2)), decreasing = TRUE)
top_word_freq2 <- head(word_freq2, 80)
word_names2 <- colnames(dtm2)
# Generate word cloud using the correct word names
wordcloud(
words = word_names2,
freq = top_word_freq2,
max.words = 80,
scale = c(4, 0.5), # Control for size of the most and least frequent words
random.order = FALSE, # Higher frequency words appear first
rot.per = 0.25, # Allows some rotation for fitting
colors = brewer.pal(8, "Dark2") # Enhances visual appeal
)Most Frequent Emojis:
- Guidance and Clarity: Directional emojis like 👉 and ➡️ are crucial in navigating users through content effectively
- Thematic Relevance: Emojis related to specific themes (e.g., 🌍 for global issues, 💻 for technology) help in visually categorizing the content
- Emotional Connection: Emojis that convey emotions or actions (e.g., 💛, 💬) can humanize the content
# Analyze the frequency of different emojis
emoji_freq <- table(unlist(hslu_zhaw_high_engagement_tweets$emojis))
sort(emoji_freq, decreasing = TRUE)##
## 🇨🇭 👉 🎉 ➡️ 🌼 🍾 ✨ ⚖️ 🌍 🌞 🌳 🌸 🎁 🎄 🐰 👋 💛 💬 💻 📈 📢 📫 📰 📺 🔘 😀
## 4 4 3 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 😉 😎 🤔 🥂 🥑 🥗 🥳
## 1 1 1 1 1 1 1
BFH Frequency Analysis
In this section, we will analyze BFH (Bern University of Applied
Sciences) to find out what tweets they usually use the most.
Text Preprocessing:
For this we must again take text prepossessing measures, as in the previous analyses.
#Filter Tweets for HSLU and ZHAW
bfh_tweets <- filtered_tweets %>%
filter(university %in% "bfh")
# Rebuild the corpus with the sampled data
corpus4 <- Corpus(VectorSource(bfh_tweets$clean_text))
corpus4 <- tm_map(corpus4, content_transformer(tolower)) # Convert to lower case
corpus4 <- tm_map(corpus4, removePunctuation) # Removing punctuation marks
corpus4 <- tm_map(corpus4, removeNumbers) # Removing numbers
corpus4 <- tm_map(corpus4, removeWords, stopwords("german")) # Removing stop words
corpus4 <- tm_map(corpus4, removeWords, stopwords("french"))
corpus4 <- tm_map(corpus4, removeWords, stopwords("italian"))
corpus4 <- tm_map(corpus4, removeWords, stopwords("english"))
corpus4 <- tm_map(corpus4, stripWhitespace) # Removal of additional spaces
corpus4 <- tm_map(corpus4, stemDocument) #remove suffixes, etc.; only root form of the word
# Further clean the text by removing specific web/text symbols and terms
corpus4 <- tm_map(corpus4, content_transformer(function(x) {
x <- gsub("–", "", x)
x <- gsub("…", "", x)
x <- gsub("«", "", x)
x <- gsub("»", "", x)
x <- gsub("\\b(rt|www|emojiemoji)\\b", "", x, ignore.case = TRUE) # Remove 'rt', 'www', and 'emojiemoji'
x <- gsub("amp", "", x, ignore.case = TRUE) # Remove 'amp' from HTML encoded '&'
x <- gsub("http[s]?://\\S+", "", x) # Remove URLs
return(x)
}))
# Create DTM and remove sparse terms
dtm3 <- DocumentTermMatrix(corpus4, control = list(removePunctuation = TRUE, stopwords = TRUE, wordLengths = c(1, Inf)))
dtm3 <- removeSparseTerms(dtm3, sparse = 0.99) # Adjust sparsity threshold as neededText Analyse:
The word cloud illustrates which topics have the highest frequency.
- Practical and Innovative Focus: Terms like “Praxis” (practice) and “Innov” (innovation) indicate a strong link between academic content and real-world applications, appealing particularly to an audience interested in actionable and cutting-edge information.
- “Community and Collaboration:” Words such as”zusammen” (together) and “unsere” (our) reflect a community-focused approach, promoting collective efforts and teamwork within the university setting.
- Local Identity and Quality: The mention of “Schweizer” (Swiss) suggests content with a national focus, likely resonating with local pride, while “Qualität” (quality) underscores the university’s commitment to high standards in education and research.
set.seed(123)
# Ensure word names are captured
word_freq3 <- sort(rowSums(as.matrix(dtm3)), decreasing = TRUE)
top_word_freq3 <- head(word_freq3, 80)
word_names3 <- colnames(dtm3)
# Generate word cloud using the correct word names
wordcloud(
words = word_names3,
freq = top_word_freq3,
max.words = 80,
scale = c(4, 0.5), # Control for size of the most and least frequent words
random.order = FALSE, # Higher frequency words appear first
rot.per = 0.25, # Allows some rotation for fitting
colors = brewer.pal(8, "Dark2") # Enhances visual appeal
)Most Frequent Emojis:
- Technology and Innovation: Directional emojis like 👉 and devices such as 💻 and innovations (🔋, 🚀, 🤖) dominate, highlighting content on technological advancements and future trends.
- Environmental Themes: Nature-related emojis (🌴, 🌲, 🌳, ♻️) emphasize environmental issues and sustainability efforts.
- Community and Celebrations: Emojis like 🎉 and 👏 are used for celebrations and achievements, fostering community spirit.
- Health and Lifestyle: Emojis like 🥥, 🥦, and 🥕 suggest a focus on health and nutrition.
- Global and Cultural Awareness: Symbols like 🌐 and 🌍, along with the Swiss flag 🇨🇭, point to global awareness and local identity.
# Analyze the frequency of different emojis
emoji_freq3 <- table(unlist(bfh_tweets$emojis))
sort(emoji_freq3, decreasing = TRUE)[1:30]##
## 👉 🔋 👇 🌴 🌲 🎉 💡 💻 🚀 🤖 🇨🇭 🌳 👏 📅 🥥 🌱
## 49 16 12 11 10 10 10 10 10 10 9 9 9 9 9 8
## 🚗 🥂 ✨ 🌐 ♻️ 🎄 🐝 🥦 ☀️ 🌍 🏡 🐴 👨🎓 🥕
## 8 8 7 7 6 6 6 6 5 5 5 5 5 5